510 research outputs found

    Improving the Caenorhabditis elegans Genome Annotation Using Machine Learning

    Get PDF
    For modern biology, precise genome annotations are of prime importance, as they allow the accurate definition of genic regions. We employ state-of-the-art machine learning methods to assay and improve the accuracy of the genome annotation of the nematode Caenorhabditis elegans. The proposed machine learning system is trained to recognize exons and introns on the unspliced mRNA, utilizing recent advances in support vector machines and label sequence learning. In 87% (coding and untranslated regions) and 95% (coding regions only) of all genes tested in several out-of-sample evaluations, our method correctly identified all exons and introns. Notably, only 37% and 50%, respectively, of the presently unconfirmed genes in the C. elegans genome annotation agree with our predictions, thus we hypothesize that a sizable fraction of those genes are not correctly annotated. A retrospective evaluation of the Wormbase WS120 annotation [1] of C. elegans reveals that splice form predictions on unconfirmed genes in WS120 are inaccurate in about 18% of the considered cases, while our predictions deviate from the truth only in 10%–13%. We experimentally analyzed 20 controversial genes on which our system and the annotation disagree, confirming the superiority of our predictions. While our method correctly predicted 75% of those cases, the standard annotation was never completely correct. The accuracy of our system is further corroborated by a comparison with two other recently proposed systems that can be used for splice form prediction: SNAP and ExonHunter. We conclude that the genome annotation of C. elegans and other organisms can be greatly enhanced using modern machine learning technology

    Genome-wide identification and predictive modeling of tissue-specific alternative polyadenylation

    Get PDF
    MOTIVATION: Pre-mRNA cleavage and polyadenylation are essential steps for 3'-end maturation and subsequent stability and degradation of mRNAs. This process is highly controlled by cis-regulatory elements surrounding the cleavage/polyadenylation sites (polyA sites), which are frequently constrained by sequence content and position. More than 50% of human transcripts have multiple functional polyA sites, and the specific use of alternative polyA sites (APA) results in isoforms with variable 3'-untranslated regions, thus potentially affecting gene regulation. Elucidating the regulatory mechanisms underlying differential polyA preferences in multiple cell types has been hindered both by the lack of suitable data on the precise location of cleavage sites, as well as of appropriate tests for determining APAs with significant differences across multiple libraries. RESULTS: We applied a tailored paired-end RNA-seq protocol to specifically probe the position of polyA sites in three human adult tissue types. We specified a linear-effects regression model to identify tissue-specific biases indicating regulated APA; the significance of differences between tissue types was assessed by an appropriately designed permutation test. This combination allowed to identify highly specific subsets of APA events in the individual tissue types. Predictive models successfully classified constitutive polyA sites from a biologically relevant background (auROC = 99.6%), as well as tissue-specific regulated sets from each other. We found that the main cis-regulatory elements described for polyadenylation are a strong, and highly informative, hallmark for constitutive sites only. Tissue-specific regulated sites were found to contain other regulatory motifs, with the canonical polyadenylation signal being nearly absent at brain-specific polyA sites. Together, our results contribute to the understanding of the diversity of post-transcriptional gene regulation. AVAILABILITY: Raw data are deposited on SRA, accession numbers: brain SRX208132, kidney SRX208087 and liver SRX208134. Processed datasets as well as model code are published on our website: http://www.genome.duke.edu/labs/ohler/research/UTR/

    Large-Scale Discovery of Promoter Motifs in Drosophila melanogaster

    Get PDF
    A key step in understanding gene regulation is to identify the repertoire of transcription factor binding motifs (TFBMs) that form the building blocks of promoters and other regulatory elements. Identifying these experimentally is very laborious, and the number of TFBMs discovered remains relatively small, especially when compared with the hundreds of transcription factor genes predicted in metazoan genomes. We have used a recently developed statistical motif discovery approach, NestedMICA, to detect candidate TFBMs from a large set of Drosophila melanogaster promoter regions. Of the 120 motifs inferred in our initial analysis, 25 were statistically significant matches to previously reported motifs, while 87 appeared to be novel. Analysis of sequence conservation and motif positioning suggested that the great majority of these discovered motifs are predictive of functional elements in the genome. Many motifs showed associations with specific patterns of gene expression in the D. melanogaster embryo, and we were able to obtain confident annotation of expression patterns for 25 of our motifs, including eight of the novel motifs. The motifs are available through Tiffin, a new database of DNA sequence motifs. We have discovered many new motifs that are overrepresented in D. melanogaster promoter regions, and offer several independent lines of evidence that these are novel TFBMs. Our motif dictionary provides a solid foundation for further investigation of regulatory elements in Drosophila, and demonstrates techniques that should be applicable in other species. We suggest that further improvements in computational motif discovery should narrow the gap between the set of known motifs and the total number of transcription factors in metazoan genomes

    Distinct polyadenylation landscapes of diverse human tissues revealed by a modified PA-seq strategy

    Get PDF
    Background: Polyadenylation is a key regulatory step in eukaryotic gene expression and one of the major contributors of transcriptome diversity. Aberrant polyadenylation often associates with expression defects and leads to human diseases. Results: To better understand global polyadenylation regulation, we have developed a polyadenylation sequencing (PA-seq) approach. By profiling polyadenylation events in 13 human tissues, we found that alternative cleavage and polyadenylation (APA) is prevalent in both protein-coding and noncoding genes. In addition, APA usage, similar to gene expression profiling, exhibits tissue-specific signatures and is sufficient for determining tissue origin. A 3? untranslated region shortening index (USI) was further developed for genes with tandem APA sites. Strikingly, the results showed that different tissues exhibit distinct patterns of shortening and/or lengthening of 3? untranslated regions, suggesting the intimate involvement of APA in establishing tissue or cell identity. Conclusions: This study provides a comprehensive resource to uncover regulated polyadenylation events in human tissues and to characterize the underlying regulatory mechanism

    Deep learning for prediction of population health costs

    Get PDF
    BACKGROUND: Accurate prediction of healthcare costs is important for optimally managing health costs. However, methods leveraging the medical richness from data such as health insurance claims or electronic health records are missing. METHODS: Here, we developed a deep neural network to predict future cost from health insurance claims records. We applied the deep network and a ridge regression model to a sample of 1.4 million German insurants to predict total one-year health care costs. Both methods were compared to existing models with various performance measures and were also used to predict patients with a change in costs and to identify relevant codes for this prediction. RESULTS: We showed that the neural network outperformed the ridge regression as well as all considered models for cost prediction. Further, the neural network was superior to ridge regression in predicting patients with cost change and identified more specific codes. CONCLUSION: In summary, we showed that our deep neural network can leverage the full complexity of the patient records and outperforms standard approaches. We suggest that the better performance is due to the ability to incorporate complex interactions in the model and that the model might also be used for predicting other health phenotypes

    Force-clamp analysis techniques reveal stretched exponential unfolding kinetics in ubiquitin

    Get PDF
    Force-clamp spectroscopy reveals the unfolding and disulfide bond rupture times of single protein molecules as a function of the stretching force, point mutations and solvent conditions. The statistics of these times reveal whether the protein domains are independent of one another, the mechanical hierarchy in the polyprotein chain, and the functional form of the probability distribution from which they originate. It is therefore important to use robust statistical tests to decipher the correct theoretical model underlying the process. Here we develop multiple techniques to compare the well-established experimental data set on ubiquitin with existing theoretical models as a case study. We show that robustness against filtering, agreement with a maximum likelihood function that takes into account experimental artifacts, the Kuiper statistic test and alignment with synthetic data all identify the Weibull or stretched exponential distribution as the best fitting model. Our results are inconsistent with recently proposed models of Gaussian disorder in the energy landscape or noise in the applied force as explanations for the observed non-exponential kinetics. Since the physical model in the fit affects the characteristic unfolding time, these results have important implications on our understanding of the biological function of proteins

    Electric fields and valence band offsets at strained [111] heterojunctions

    Full text link
    [111] ordered common atom strained layer superlattices (in particular the common anion GaSb/InSb system and the common cation InAs/InSb system) are investigated using the ab initio full potential linearized augmented plane wave (FLAPW) method. We have focused our attention on the potential line-up at the two sides of the homopolar isovalent heterojunctions considered, and in particular on its dependence on the strain conditions and on the strain induced electric fields. We propose a procedure to locate the interface plane where the band alignment could be evaluated; furthermore, we suggest that the polarization charges, due to piezoelectric effects, are approximately confined to a narrow region close to the interface and do not affect the potential discontinuity. We find that the interface contribution to the valence band offset is substantially unaffected by strain conditions, whereas the total band line-up is highly tunable, as a function of the strain conditions. Finally, we compare our results with those obtained for [001] heterojunctions.Comment: 18 pages, Latex-file, to appear in Phys.Rev.

    Control of immediate early gene expression by CPEB4-repressor complex-mediated mRNA degradation

    Get PDF
    BACKGROUND: Cytoplasmic polyadenylation element-binding protein 4 (CPEB4) is known to associate with cytoplasmic polyadenylation elements (CPEs) located in the 3' untranslated region (UTR) of specific mRNAs and assemble an activator complex promoting the translation of target mRNAs through cytoplasmic polyadenylation. RESULTS: Here, we find that CPEB4 is part of an alternative repressor complex that mediates mRNA degradation by associating with the evolutionarily conserved CCR4-NOT deadenylase complex. We identify human CPEB4 as an RNA-binding protein (RBP) with enhanced association to poly(A) RNA upon inhibition of class I histone deacetylases (HDACs), a condition known to cause widespread degradation of poly(A)-containing mRNA. Photoactivatable ribonucleoside-enhanced crosslinking and immunoprecipitation (PAR-CLIP) analysis using endogenously tagged CPEB4 in HeLa cells reveals that CPEB4 preferentially binds to the 3'UTR of immediate early gene mRNAs, at G-containing variants of the canonical U- and A-rich CPE located in close proximity to poly(A) sites. By transcriptome-wide mRNA decay measurements, we find that the strength of CPEB4 binding correlates with short mRNA half-lives and that loss of CPEB4 expression leads to the stabilization of immediate early gene mRNAs. Akin to CPEB4, we demonstrate that CPEB1 and CPEB2 also confer mRNA instability by recruitment of the CCR4-NOT complex. CONCLUSIONS: While CPEB4 was previously known for its ability to stimulate cytoplasmic polyadenylation, our findings establish an additional function for CPEB4 as the RNA adaptor of a repressor complex that enhances the degradation of short-lived immediate early gene mRNAs

    Features of mammalian microRNA promoters emerge from polymerase II chromatin immunoprecipitation data

    Get PDF
    Background: MicroRNAs (miRNAs) are short, non-coding RNA regulators of protein coding genes. miRNAs play a very important role in diverse biological processes and various diseases. Many algorithms are able to predict miRNA genes and their targets, but their transcription regulation is still under investigation. It is generally believed that intragenic miRNAs (located in introns or exons of protein coding genes) are co-transcribed with their host genes and most intergenic miRNAs transcribed from their own RNA polymerase II (Pol II) promoter. However, the length of the primary transcripts and promoter organization is currently unknown. Methodology: We performed Pol II chromatin immunoprecipitation (ChIP)-chip using a custom array surrounding regions of known miRNA genes. To identify the true core transcription start sites of the miRNA genes we developed a new tool (CPPP). We showed that miRNA genes can be transcribed from promoters located several kilobases away and that their promoters share the same general features as those of protein coding genes. Finally, we found evidence that as many as 26% of the intragenic miRNAs may be transcribed from their own unique promoters. Conclusion: miRNA promoters have similar features to those of protein coding genes, but miRNA transcript organization is more complex. © 2009 Corcoran et al
    corecore